06. Data & Subsampling

4 Data Subsampling V1

Subsampling equation

P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}

For the following quiz question, consider the following data points:

  • We have a text with 1 million words in it
  • The word "learn" appears 700 times in this text

Given a threshold, t = 1*10^-4 (or 0.0001), what is the probability that we will discard the word "learn"?

SOLUTION: 62%